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ABSTRACT 

The detection and characterization of filamentary structures in the cosmic web al¬ 
lows cosmologists to constrain parameters that dictate the evolution of the Universe. 
While many filament estimators have been proposed, they generally lack estimates 
of uncertainty, reducing their inferential power. In this paper, we demonstrate how 
one may apply the Subspace Constrained Mean Shift (SCMS) algorithm (Ozertem 
& Erdogmus 2011, Genovese et al. 2014) to uncover filamentary structure in galaxy 
data. The SCMS algorithm is a gradient ascent method that models filaments as 
density ridges, one-dimensional smooth curves that trace high-density regions within 
the point cloud. We also demonstrate how augmenting the SCMS algorithm with 
bootstrap-based methods of uncertainty estimation allows one to place uncertainty 
bands around putative filaments. We apply the SCMS first to the dataset generated 
from the Voronoi model. The density ridges show strong agreement with the filaments 
from Voronoi method. We then apply the SCMS method datasets sampled from a P3M 
N-body simulation, with galaxy number densities consistent with SDSS and WFIRST- 
AFTA, and to LOWZ and CMASS data from the Baryon Oscillation Spectroscopic 
Survey (BOSS). To further assess the efficacy of SCMS, we compare the relative loca¬ 
tions of BOSS filaments with galaxy clusters in the redMaPPer catalog, and find that 
redMaPPer clusters are significantly closer (with p-values < 10“^) to SCMS-detected 
filaments than to randomly selected galaxies. 

Key words: cosmology: observations - large-scale structure of the Universe - meth¬ 
ods: data analysis - methods: statistical 


1 INTRODUCTION 

Observations of the local universe made over the last four 
decades show that on megaparsec scales, matter is dis¬ 
tributed in web-like structures—clusters, filaments, sheets, 
and voids—that arise naturally from the non-linear evolution 
of initially small density fluctuations (Peebles 1980; Bond 
et al. 1996; Jenkins et al. 1998; Colberg et al. 2005; Springel 
et al. 2005; Dolag et al. 2006). Of particular interest to us 
are the filaments, one-dimensional structures that connect 
galaxy clusters and form at the boundaries of empty voids. 
Filaments are of interest for several reasons. The detection 
and characterization of filaments at a range of redshifts pro¬ 
vides a means by which cosmologists can constrain theories 
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of the universe’s evolution (Bond et al. 1996; Zhang et al. 
2009, 2013). Filaments also influence the shape, angular mo¬ 
mentum, and peculiar velocities of dark matter haloes (Hahn 
et al. 2007b,a; Paz et al. 2008; Hahn et al. 2009; Zhang et al. 
2009; Jones et al. 2010; Zhang et al. 2013; Forero-Romero 
et al. 2014), as well as the intrinsic alignments and lumi¬ 
nosities of nearby galaxies (Guo et al. 2015; Clampitt et al. 
2014; Codis et al. 2014). 

As the review of Cautun et al. (2014) amply demon¬ 
strates, the detection of filamentary structure is a non¬ 
trivial problem for which many solutions have been pro¬ 
posed. These solutions include methods that examine the 
Hessian matrix of the galaxy density field, such as the Mul¬ 
tiscale Morphology Filter (MMF; Aragon-Calvo et al. 2007, 
2010a) and NEXUS and NEXUS+ (Cautun et al. 2013), 
as well as segmentation-based methods, such as the Candy 
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model (Stoica et al. 2007; Stoica et al. 2005), the skeleton 
(Novikov et al. 2006), the Spine method (Aragon-Calvo et al. 
2010b), and DisPerSE models (Sousbie 2011), and the path 
density method (Genovese et al. 2009). While all of these 
methods provide estimates of hlamentary structure, none 
provide an assessment of estimator uncertainty. The fact 
that filament estimates are random sets presents a significant 
challenge to the construction of valid uncertainty measures 
(Molchanov 2005). 

In this paper, we introduce a new method for filament 
detection based on the Subspace Constrained Mean Shift 
(SCMS) algorithm of Ozertem & Erdogmus (2011). The sta¬ 
tistical properties of SCMS were studied in Genovese et al. 
(2014). The mathematical properties of density ridges and 
the statistical consistency of SCMS are discussed in Eberly 
(1996); Genovese et al. (2014), and Chen et al. (2014a), re¬ 
spectively, while Chen et al. (2014a) introduce an uncer¬ 
tainty measure to the ridge formalism that allows one to 
quantitatively assess, in the context of the current paper, 
putative cosmic filaments. 

In §2, we describe the SCMS algorithm and the methods 
we use to assess the uncertainty of its filament estimates. In 
§3, we apply SCMS, first to a P3M N-body simulation out¬ 
put (Trac et al. 2015), and then to low-redshift (0.235 ^ ^ 

0.240) and high-redshift (0.530 ^ ^ ^ 0.535) data collected 
by the Baryon Oscillation Spectroscopic Survey (BOSS), 
which was released as part of SDSS Data Release 11. We also 
demonstrate the consistency between filaments detected by 
SCMS and galaxy clusters listed in the redMaPPer catalog. 
In §4 we summarize our results and offer possible avenues 
for future methodological development. In Appendix A we 
provide further detail on how to optimally select values for 
the tuning parameters of the SCMS algorithm, while in Ap¬ 
pendix B we apply the algorithm to labeled simulated data 
generated via the Voronoi model of (van de Weygaert 1994) 
to show that it preferentially detects structures labeled as 
filaments. In a second paper, we will provide a full catalogue 
of filaments detected in SDSS data. 


2 SUBSPACE CONSTRAINED MEAN SHIFT: 
ALGORITHM 

2.1 Density Ridge Formalism 

Assume that we observe n galaxies with locations 
Al, • • • , Xn that are d—dimension points; for data from typ¬ 
ical astronomical surveys, d = 2 (if the galaxies are con¬ 
strained to a redshift shell) or d = 3. We model Ai, • • • , Xn 
as random variables sampled from an unknown density func¬ 
tion p. 

Formally, a density ridge (Eberly 1996; Ozertem A Er¬ 
dogmus 2011; Genovese et al. 2014; Chen et al. 2014a,b) 
of p is defined as follows. Let g{x) = Vp(x) and H(x) 
be the gradient and Hessian, respectively, of p{x) and let 
r'i(x), • • • ,Vd{x) be the eigenvectors of the Hessian matrix, 
with associated eigenvalues Ai(x) ^ \ 2 {x) ^ ^ ^d{x). 

We define V(x) to be the matrix of all eigenvectors orthog¬ 
onal to the first, [v 2 {x), • • • ,Vd{x)], and the ridge set R as 


R = Ridge(p) = {x : G(x) = 0, A 2 (x) < 0} , (1) 


where 

G{x) = V{x)V{x)'^g{x) (2) 

is the projected gradient. The fact that ridges have pro¬ 
jected a gradient of 0 (and second eigenvalues being nega¬ 
tive) means that ridges are local maximums in the subspace 
spanned by eigenvectors V 2 {x), • • • , Vd{x). When p is smooth 
and the eigengap 

/3(x) = Ai(j;) - A2 (x) (3) 

is positive, the ridges have the properties of filaments, 
i.e. smooth curve-like structures with high density (see Fig¬ 
ure 1). Note that R will include modes of the density p which, 
in the context of cosmic filament detection, means that R 
contains both hlaments and galaxy clusters. Also note that 
density ridges are more general objects than the skeleton 
models proposed in Novikov et al. (2006); Sousbie et al. 
(2008) and the Spine method (Aragon-Calvo et al. 2010b). 
Essentially, when d = 2, 3, density ridges are the same as 
skeletons. 

Compared with other models, density ridges adapt in¬ 
formation from both gradient and Hessian matrix of den¬ 
sity. In contrast, MMF (Aragon-Calvo et al. 2007, 2010a), 
NEXUS and NEXUS+ (Cautun et al. 2013) only use the 
information of second derivatives (they define filaments as 
the regions with A 2 (x) < 0 and Ai(x) A 2 (x) > A 3 (x)). 
DisPerSE models (Sousbie 2011) define filaments as those 
gradient flows that start from saddle points and end up at 
local maximums, which utilize only the first derivatives. 

An attractive feature for the density ridge model is that 
the statistical theory for consistently estimating the density 
ridge has been well-established (Genovese et al. 2014; Chen 
et al. 2014a,b). We also use N-body simulation to verify the 
convergence of density ridges when we subsample different 
number of galaxies (see Section 3.2). 


2.2 SCMS: Filament Detection 


The algorithm consists of three steps described below and 
listed in Algorithm 1. The first is to estimate the under¬ 
lying density function p{x) given Ai, • • • , An, the observed 
locations of galaxies. We use the standard kernel density 
estimator (see e.g. Wasserman 2006): 


p{x) = 


1 

nh^ 


i=l 




(4) 


where A(') is the smoothing kernel (e.g. a Gaussian), ||x — 
Ai|| is the Euclidean distance between the point x and the 
galaxy location Ai, and h is the smoothing bandwidth 
(the selection of which is discussed in Appendix A). 

In the second step, we denoise by applying a threshold 
to the estimated density function p{x) to eliminate the effect 
that galaxies in low-probability density regions, i.e. where 
p{x) < T, would have on filament estimation. How one se¬ 
lects T is also discussed in Appendix A. The denoising step 
is not part of the original SCMS algorithm but is impor¬ 
tant to increase its statistical power in low-density regions 
(see Figure 3. We note that a thresholding step is included 
in several filament-detection algorithms, including those of 
e.g. Novikov et al. (2006) and Sousbie (2011). 

For the hnal step, given a set of galaxies in high-density 
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(a) 



(b) 


Figure 1. Examples of ridges (blue curves) in a smooth function. 



Figure 2. Pictoral overview of the SCMS algorithm (Step 4 in Algorithm 1). Each point in an initially uniform mesh (the blue dots 
in the top-left panel) is moved to the closest density ridge (bottom right). The top-middle, top-right, bottom-left, bottom-middle, and 
bottom-right panels indicate the locations of the mesh points after 1, 2, 4, 8, and 16 iterations of the algorithm, respectively. 


regions, we apply the original version of the SCMS (Ozertem 
& Erdogmus 2011) to detect filamentary structures. Given 
a point a: on a defined, uniform mesh, SCMS moves it ac¬ 
cording to an “estimated projected gradient” given by 

G{x) = V{x)V{xYg{x ), (5) 

where V{x)^'g{x) are estimates of the quantities V{x),g{x) 
that we define above in Section 2.1. One may view this pro¬ 
cedure as estimating a ridge set R by applying the Ridge 


operator to p: 

R = Ridge(p). (6) 

Essentially, R is very similar to the filaments defined in Sous- 
bie et al. (2008); Bond et al. (2010); Choi et al. (2010). Note 
that a putative filament is, in the context of this algorithm, 
a set of points and not a one-dimensional curve. In Step 4 of 
Algorithm 1, We further describe how we apply SCMS. In 
Figure 2, we illustrate the application of SCMS to uniform 
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Without Thresholding 


With Thresholding 




(a) Without thresholding (b) Thresholding 

Figure 3. An example of the comparison of SCMS with and without noise removal. This is a simple simulated dataset with clutter noise. 
As can be seen easily, thresholding the density removes problems of clutter noise . 



(a) 


(b) 


(c) 


Figure 4. An example of the application of SCMS. (a) The original data, (b) Contour plot showing the kernel density estimate of the 
density p. (c) The ridge estimate (blue curve). Note that in (b), we remove points where the estimated density is less than a threshold r. 


mesh of points, and in Figure 3 we demonstrate the impor¬ 
tance of the thresholding step: the left and right panels show 
putative filaments detected without and with thresholding, 
respectively. We observe that thresholding greatly decreases 
the rate of false filament detection. 

2.3 SCMS: Filament Uncertainty Estimation 

We quantify the uncertainty in the filament estimates pro¬ 
duced by SCMS using the concept of local uncertainty (Chen 
et al. 2014a). The local uncertainty in an estimated filament 
at a point x on the true filament R is the expected dis¬ 
tance between x and the closest point to x on R. This is 
denoted by p[x) and is given by: 

p(x)=|\/®RR , ( 9 ) 

0 otherwise 

where dp^^x) — min{||x — y\\ ■ y ^ R} and the notation E[-] 
denotes the expected value operator. p{x) is the radius of a 
local confidence ball that surrounds the point x\ the more 
uncertain the true location of the estimated filament, the 
larger the value of p{x). We estimate p{x), which is defined 


as a function of the unknown density field p and the unknown 
filament set R, by utilizing bootstrap resampling. 

In this paper, we consider both the original version 
of bootstrap (Efron 1979) and the smooth bootstrap. The 
smooth bootstrap (see e.g. Silverman & Young 1987) is a vari¬ 
ant of the bootstrap that is useful in functional estimation 
problems in which the bootstrap sample is drawn from the 
estimated density p instead of the original data. When the 
smoothing kernel is a bivariate Gaussian, we generate the 
smooth bootstrap sample via the following two steps: 

1. Generate the bootstrap sample. 

2. Add independent and identically distributed Gaus¬ 
sian noise with variance . 

Unlike the bootstrap, the smooth bootstrap takes into ac¬ 
count both the variance and the bias of filament estimation, 
but with less precision in variance estimation with respect 
to the bootstrap. 

Assume we generate B bootstrap samples, and each 
of them is denoted as b = 1, • • • , B. 

For each bootstrap sample, say • • • ,Xn^^\ we com¬ 

pute the density estimate p*^^\ the ridge estimate R*^^^ = 
Ridge(^^^^), and the confidence ball radii p(b){x) for all 
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Algorithm 1 SCMS (Subspace Constrained Mean Shift) 

Input: Data {Xi, • • • ,Xn}- Smoothing bandwidth h. Threshold r. 

Step 1. Compute the density estimator p(x) via equation (4). 

Step 2. Select a mesh M of points. By default, we can take A4 = • • • , Xn}- 

Step 3. Thresholding: remove m e M if p(m) < r. Let the remaining mesh points be denoted M' . 

Step 4. For each x G A4' , perform the following subspace constrained mean shift until convergence: 


Step 4-1. For z = 1, • • • , n, compute 


IJ^i = 


a;-A, 


Ci^K 


x-Xi 


Step 4-2. Compute the Hessian matrix 




i = l 


Step 4-3. Perform spectral decomposition on H{x) and compute V{x) — (u 2 (x), • 
ing to the smallest d — 1 eigenvalues. 

Step 4-4. Update x i — V{x)V{x)^rri{x) + x until convergence, where 

m{x) = - 

is called the mean shift vector. 


( 7 ) 

, U(i(x)), the eigenvectors correspond- 


( 8 ) 


Output: The collection of all remaining points. 



(a) Local uncertainty. 



(b) Uncertainty band. 


Figure 5. An illustration of the uncertainty measure for SCMS. In (a), we display the uncertainty measures with different color (red: 
highly uncertain). The unit to the color is the same as x and Y axis. In (b), we show the uncertainty measures by a gray region around 
the filament (blue). Note that this shows that the SCMS has more uncertainty measures around the highly curved regions and the end 
points. 


X E R. We estimate p{x) by adding the B radius estimates 
in quadrature: 


p{x) = 




1 

B 


B 


6=1 


( 10 ) 


In Algorithm 2 we outline the computational steps that one 
must follow to derive p{x). 

Note that calculating the uncertainty measure is not 
part to the SCMS algorithm-we can detect filaments with¬ 
out using the uncertainty measure. However, this uncer¬ 


tainty measure is a feature that SCMS filaments have. This 
measure has a geometric interpretation and can be consis¬ 
tently estimated. See Chen et al. (2014a) for more involved 
discussion. Note that other filament finders do have have 
such a statistically consistent error measurement. 


2.4 SCMS: Boundary Bias 

When computed with a kernel density estimator as in equa¬ 
tion (4), SCMS filament estimates suffer from boundary bias 
within ~ two bandwidths of the edge of the observation 
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Algorithm 2 Uncertainty Measure for SCMS 

Input: Data {Xi, • • • , An}. Smoothing bandwidth h. Threshold r. 

Step 1. Perform SCMS on {Ai, • • • , An} to detect filaments; denote the estimated filaments by R. 
Step 2. Generate B bootstrap samples: A}^^\ • • • , An^^^ for b = 1, • • • , B. 

Step 3. For each bootstrap sample, apply SCMS which yields for b — 1, • • • , 

Step 4. For each x e R, calculate — (f{x^ b — 1, • • • , 

Step 5. Compute p(x) = [mean{pi(x), • • • ,Ps(^)}]^^^- 
Output: p{x). 


region. This is a systematic deviation from the true fila¬ 
ment caused by the density estimator averaging over a re¬ 
gion where no data can be observed, and it can degrade 
the confidence band coverage probabilities near the bound¬ 
ary. One remedy for boundary bias is to include additional 
data immediately outside of the region of interest. Includ¬ 
ing galaxies within 2h of the boundaries eliminates most 
of the boundary bias, since very little of the volume under 
a bivariate Gaussian kernel lies beyond that point. If one 
cannot include additional data points outside the bound¬ 
aries (for instance, due to overall survey limits), then one 
must be careful when interpreting filaments detected near 
the boundaries. 


2.5 Filament Coverage 


Here we introduce some useful geometric concepts about 
coverage. Given two sets A and B. The coverage of B by 
A is defined as 


Covb(A) = 


Number of points in {A H B) 
Number of points in B 


( 11 ) 


Note that when A and B are curves, they will contain infinite 
number of points. In this case, we will replace ‘number of 
points in’ by ‘the length of’. Similarly, we can define the 
coverage of A by H as Cova(H). 

Given two collections of filaments Ri and i? 2 , since Ri 
and R 2 are curves so that they may not intersect each other 
in general so that the coverage is 0. Thus, instead of directly 
compute their coverage, we consider a flatten version of Ri 
(and R 2 respectively). We define 


Ri = {x : d(x, Ri) ^ r} 


( 12 ) 


as the r-flatten set of Ri . Then we define the coverage of R 2 
by as a function of r as 


Covi ?2 (r; Ri) 


Number of points in (R 2 H (Ri 0 r)) 
Number of points in R 2 


(13) 


Similarly, we can define Co\/R^{r; R 2 ). The two functions 
Coy R^(r; R 2 ) and Covi? 2 (r; i^i) contain information about 
the similarity between Ri and R 2 . 

In simulation, we are able to define true filaments, say 
Rtrue, and we will have an estimate filament, denoted as Rn- 
Then we call the quantity CoyRn) the true positive 
coverage (ratio of true filaments being covered by estimated 
filaments) and we call 1 — Cov^ (r; Rtme) the false positive 
coverage (Cov^ (r; Rtme) is the ratio of estimated filament 
being covered by truth so that 1 minus this ratio is the ratio 
of false positive). See Figure 9 for an example of true positive 
and false positive coverage. 


Gombining the uncertainty measures and the coverage, 
we can study the properties of the uncertainty band.. An un¬ 
certainty band for a detected filament is simply the union of 
the confidence balls computed for each point on the filament, 

i.e. 

U(k) = R®kp= B{x,kp{x)), (14) 

xER 

where B(x,r) = {y : ||x — y\\ ^ r} represents the set of 
points within a ball centered at x and with radius r. Denote 
the region within the uncertainty band as A. The coverage 
for A is then 

FCov(^) = CovH._(/l) 

_ Number of points in (A n Rtme) 

Number of points in Rtme 

One can think of FCov(A) as the true positive coverage us¬ 
ing a set A. For instance, if FCov(A) = 0.8, then on average, 
80% of the points on any given true filament lie within its 
associated uncertainty band, and 20% lie outside the band. 
This interpretation of coverage differs from the standard in¬ 
terpretation of confidence band coverage, thus motivating 
our use of the term “uncertainty band” instead of “confi¬ 
dence band.” Figure 10 gives examples of the coverage for 
uncertainty bands U{k) with k G (0,3) and n = 250 and 
2500. As we observe in Figure 10, the coverage percentage 
depends sensitively on the sample size n; thus, we cannot 
provide simple rules for converting ka uncertainty bands to 
coverage percentages. 


3 SUBSPACE CONSTRAINED MEAN SHIFT: 

APPLICATIONS 

3.1 Voronoi Dataset 

To show the effectiveness of capturing filaments, we compare 
the SCMS filaments (density ridges) to the filaments in the 
Voronoi model. The Voronoi model (van de Weygaert 1994) 
applies Voronoi tessellation to compute a density estimate 
for galaxies as well as the curvature of that estimate. Given a 
curvature estimate, the Voronoi method assigns a class label 
to each galaxy, indicating the type of large-scale structure 
to which to associate the galaxy. There are four possible 
classes: cluster, filament, wall, and void. 

We use the SCMS algorithm to analyze a simulated 
dataset (256^ galaxies, each with a class label, that span a 
100 X 100 X 100 Mpc^ box) generated with the Voronoi model 
(M. A. Aragon-Calvo, private communication). Figure 6 
shows a comparison between our density ridges (blue curves) 
and galaxies with different class labels (brown: cluster; red: 
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filament; green: wall; pink: void). The two methods gener¬ 
ate remarkably similar results: Voronoi clusters (i.e. galaxies 
labeled cluster) occur at the intersection points of density 
ridges; Voronoi filaments surround the density ridges; and 
Voronoi walls span surfaces on which the density ridges lay. 

To further quantify the association between density 
ridges and each Voronoi model class, we study their pro¬ 
jection distances onto each other. Note that the distribution 
of projection distances is related to filament coverage; fur¬ 
ther discussion of this may be found in Chen et al. (2015). 
Figure 7 displays the distributions of projection distances. 
In both panels, we see that the distribution for ridges ver¬ 
sus the Voronoi filaments peaks at distances < 1 h~^Mpc 
This indicates that the density ridges and the Voronoi fila¬ 
ments are very similar. On the other hand, the projection 
distances from the density ridges increases as we consider 
clusters, walls, and voids; the distributions exhibit increas¬ 
ing positive skewness. 

3.2 P3M N-body Simulation 

To further demonstrate the efficacy of SCMS, we apply it 
to P3M N-body simulations from Trac et al. (2015), which 
assume a ACDM cosmology with = 0.3, = 0.7, Qb = 

0.045, h = 0.7, as = 0.8 and Us = 0.96. Each side of the 
simulation box is of length 1 Gpc/h, and each contains 2048^ 
particles. 

In Figure 8, we demonstrate that, as sample size in¬ 
creases, SCMS outputs filament estimates that are closer to 
the true filaments (defined by the true density function); 
the uncertainty measures capture SCMS errors due to the 
sampling variability. We take a slice of the full simulation 
data {x,y G [125,375] Mpc/h and z G [100,105] Mpc/h) and 
smooth the data with smoothing bandwidth h = 5 (recom¬ 
mended by the selection rule in Appendix A with Aq = 0.4) 
to get the density function and the filaments (cyan curves). 
Figure 8(a) shows a contour plot for the density function. 
The original sliced data contains 88,406 points (gray dots). 
We downsample to get three different subsamples; each con¬ 
tains 250/2500/10000 particles. For each subsample (black 
dots), we apply SCMS to detect filaments (blue curves). 
Note that the convergence phenomena of Figure 8 are fur¬ 
ther quantified by the true positive and false positive cover¬ 
age plot in Figure 9. 

Note that the sparsest subsample n = 250 has a galaxy 
number density 5.56 x 10~^ Mpc“^ which is similar to the 
number density observed in SDSS CMASS data (~ 4 x 10“^ 
Mpc“^). The future survey Wide-Field Infrared Survey Tele- 
seope (WFIRST)f a NASA mission with science objectives 
in exoplanet exploration, dark energy research and galac¬ 
tic and extragalactic surveys, will observe a number density 
similar to the n = 2500 subsample (~ 5.56 x 10“^ Mpc“^). 

We show the uncertainty measures and filament cover¬ 
age for n = 2500 in Figure 10. We plot filament coverage for 
confidence regions U{k) for k G (0,3) in Figure 10(a), where 
n = 250 and 2500, and where p is estimated by both the 
bootstrap (BT) and the smooth bootstrap (SB). This range 
contains sample sizes that are in line with both CMASS 
(n ~ 250) and WFIRST {n ~ 2500) data. We observe that 

^ http://wfirst.gsfc.nasa.gov/ 


filament coverage is, as noted above, sensitive to the sample 
size n and that the smooth bootstrap provides considerably 
more conservative confidence bands, particularly for /c < 2. 
The gray regions displayed in Figure 10(b) are the smooth 
bootstrap confidence regions U{1), which we estimate con¬ 
tain 85% of the true filaments (cyan curves). 

Figure 11 illustrates the effect of boundary bias in the 
n = 2500 subsample by comparing the estimates and uncer¬ 
tainties with padded and unpadded data near the bound¬ 
ary. Panels (a) and (b) show the boundary bias. Note that 
the red curves are filaments estimated by using only points 
within the boundary (given by the orange rectangle). The 
blue curves are filaments detected by SCMS with boundary 
points (i.e. points outside the orange rectangle). As can be 
seen, the estimation of filaments without boundary data (red 
curves) becomes more inaccurate as we approach the bound¬ 
ary. The boundary bias occurs for hlaments with distances 
less than 10 Mpc/h (2 times smoothing parameter h) to the 
boundaries. The uncertainty measures also show the influ¬ 
ence of boundary bias. Figure 11(c) and 11(d) exhibit the 
uncertainty measures for filaments estimated with and with¬ 
out boundary points. As expected, the uncertainty measures 
in panel (d) increase as we move close to the boundary. 

3.3 Sloan Digital Sky Survey 

3.3.1 Data 

We further demonstrate the efficacy of SCMS by applying 
it to data from Data Release 12 (Alam et al. 2015) of the 
Sloan Digital Sky Survey (SDSS; York et al. 2000). Together, 
SDSS I, II (Abazajian et al. 2009), and III (Eisenstein et al. 
2011) used a drift-scanning mosaic CCD camera (Gunn et al. 
1998) to image over one third of the sky (14,555 square de¬ 
grees) in five photometric bandpasses (Fukugita et al. 1996; 
Smith et al. 2002; Doi et al. 2010) to a limiting magnitude of 
r CR 22.5, using the dedicated 2.5-m Sloan Telescope (Gunn 
et al. 2006) located at Apache Point Observatory in New 
Mexico. The imaging data were processed through a series 
of pipelines that perform astrometric calibration (Pier et al. 
2003), photometric reduction (Lupton et al. 2001), and pho¬ 
tometric calibration (Padmanabhan et al. 2008). All of the 
imaging was reprocessed as part of SDSS Data Release 8 
(2011ApJS..193...29A; Aihara et al. 2011). 

The Baryon Oscillation Spectroscopic Survey (BOSS) 
has obtained spectra and redshifts for 1.35 million galaxies 
over a footprint covering 10,000 square degrees. These galax¬ 
ies are selected from the SDSS 2011ApJS..193...29A imaging 
and are being observed together with 160,000 quasars and 
approximately 100,000 ancillary targets. The targets are as¬ 
signed to tiles of diameter 3° using a 2003AJ....125.2276B 
algorithm that is adaptive to the density of targets on the 
sky (Blanton et al. 2003). Spectra are obtained using the 
double-armed BOSS spectrographs (Smee et al. 2013). Each 
observation is performed in a series of 900-second expo¬ 
sures, integrating until a minimum signal-to-noise ratio is 
achieved for the faint galaxy targets. This ensures a homo¬ 
geneous data set with a high redshift completeness of more 
than 97 percent over the full survey footprint. Redshifts are 
extracted from the spectra using the methods described in 
Bolton et al. (2012). A summary of the survey design ap- 
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All galaxies 



(a) All galaxies 


Ridges and Clusters (Voronoi) 



(b) Galaxies with label = clusters 


Ridges and Filaments (Voronoi) 



(c) Galaxies with label = filaments 


Ridges and Walls (Voronoi) 



(d) Galaxies with label = walls 


Ridges and Voids (Voronoi) 



(e) Galaxies with label = voids 


Figure 6. A comparison between density ridges and Voronoi model. In each panel, the blue curves are density ridges using all galaxies. 
Panel (b)-(e) display the comparison of density ridges to the Voronoi clusters, filaments, walls and voids. In panel (c), we see a remarkable 
similarity between density ridges and the Voronoi filaments. 




Figure 7. The distributions for projection distances from Voronoi-model-derived structures onto density ridges (left panel) and vice-versa 
(right panel). Both panels indicate that density ridges trace structures most similar to Voronoi filaments. 


pears in Eisenstein et al. (2011), and a full description is 
provided in Dawson et al. (2013). 

BOSS selects two classes of galaxies to be targeted 
for spectroscopy using SDSS 2011ApJS..193...29A imaging: 
‘LOWZ’ and ‘CMASS’ (we refer the reader to Anderson 
et al. (2014) for further description of these classes). For the 
LOWZ sample, the effective redshift is Zeff = 0.32, slightly 
lower than that of the SDSS-II luminous red galaxies (LRGs) 
as we place a redshift cut z < 0.43. The CMASS selection 


yields a sample with a median redshift z = 0.57 and a stellar 
mass that peaks at log^Q (M/Mq) = 11.3 (Maraston et al. 
2013). Most CMASS targets are central galaxies residing in 
dark matter haloes of mass ~ Mq. 

We test SCMS using two slices of data: at low and high 
redshift. The low-z dataset comprises 1,158 galaxies in the 
volume 

135° ^ RA ^ 175° 5° ^ (5 ^ 45° 0.235 ^ ^ 0.240 
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Density Contour 



150 200 250 300 350 

X 

(a) Density field and true filaments 


n = 2500 (density: 5.56 x 10'^/Mpc^) 



(c) n=2500 ( WFIRST) 


n = 250 (density: 5.56 x lO'^/Mpc®) 



150 200 250 300 350 

X 

(b) n=250 (CMASS) 


n = 10000 (density: 2.22 x 10'^/Mpc^) 



150 200 250 300 350 

X 

(d) n=10000 


Figure 8. A simulated example to show the consistency of SCMS. This data is a slice of an N-body simulation in a box; the unit of 
X and Y axes is Mpc/h. We take a slice with width 5 Mpc/h. The original sample contains 88,406 particles. The color contour is the 
galaxy density field from the original sample with smoothing parameter h = S and the true filaments (cyan) are density ridges of this 
density field. We subsample under various sizes. The blue curves are estimated filaments based on the subsample (black dots). One can 
see a clear pattern; as sample size (for the subsample) increases, the estimated filaments are closer to the true filaments. See Section 3.2 
for more details. 


while the high-z dataset lies in the volume 

135° ^ RA ^ 175° 5° ^ 5 ^ 45° 0.530 ^ ^ 0.535 

and contains 4,678 galaxies. Both samples have a very thin 
redshift range Xz — 0.005 (the corresponding comoving dis¬ 
tance is around 14 — 21 Mpc) so that their constituent galax¬ 
ies may be viewed as lying on a two-dimensional surface with 
coordinates (RA,(5). 

There are two principal reasons for our choice to per¬ 
form a two-dimensional analysis of the SDSS data. The first 
is that there is too large a change in the number density of 
detected galaxies over the SDSS redshift range. The SCMS 
algorithm incorporates kernel density estimation to locate 
density ridges, and KDE requires a fixed smoothing param¬ 
eter h. However, in low-density regions, h should be large to 
obtain reliable results, while in high-density regions, h has 
to be small so as to not oversmooth the point cloud. The 


second reason is that when z > 0.2, the number density is 
very low, and performing a three-dimensional analysis will 
produce results with large statistical errors due to the small 
sample size. Lower-dimensional analyses result in decreased 
statistical error; see e.g. Wasserman 2006. 


3.3.2 Results 

We apply SCMS to the low-^ data with smoothing band¬ 
width h = 2.50° (41.8 Mpc) and threshold level r = 
1.02 X 10“^; we display our results in Figure 12. For the high- 
z data, h and r are 2.03° (71.1 Mpc) and r = 7.52 x 10“^, 
respectively; we display our results in Figure 13. Note that 
we have included additional galaxies within 5 degrees of our 
selected window to mitigate boundary bias. 

As can be seen in Figures 12 and 13, SCMS filament 
estimates capture high density regions and they exhibit 
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h = 5 Mpc, True Positive 



(a) True positive coverage. 

Figure 9. True positive and false positive coverage. As can be 
size, whereas the false positive coverage decreases. 


h = 5 Mpc, Raise Positive 



(b) False positive coverage 

for all distances, the true positive coverage increases with sample 


Filament Coverage n = 2500, Smooth Bootstrap 




(a) Filament coverage. (b) N = 2500, smooth bootstrap 

Figure 10. Filament coverage based on the uncertainty measure, (a) The filament coverage FCov(?7(fc)) as a function of k (x-axis). We 
also provide the coverage for Gaussian distribution (probability being within kcr to the center of Gaussian) as a reference, (b) Visualizing 
the uncertainty by color and a confidence set for the subsample with n = 2,500 with the uncertainty measure estimated via the smooth 
bootstrap. The cyan curves are the true filaments. Note that the gray regions are U{k)^ k = equivalent to the error bar for 1 x cr, based 
on the smooth bootstrap estimate. From panel (a), we know that the gray regions contain about 85% true filaments (cyan curves). The 
unit to the color in uncertainty band is Mpc, the same as X and Y axes. 


one-dimensional, nearly connected structures. In addition, 
SCMS yields smooth filaments; most filament estimators do 
not output such smooth structures (cf. Stoica et al. 2007; 
Stoica et al. 2005; Sousbie 2011; Aanjaneya et al. 2012; Lecci 
et al. 2013). We note that the filaments detected by SCMS 
will not actually connect with each other; points on merg¬ 
ing filaments have eigengap /3 (equation 3) that asymptote 
toward 0, making the density ridge ill-defined since the first 
and second eigenvalues become equal. We note that in both 
figures there are possibly spurious filaments; for instance, in 
Figure 12, at (RA,(5)= (165°,40°) and (165°,20°), we see fil¬ 
aments that are associated with a relatively small number of 


galaxies. As we demonstrate below, these putative filaments 
have higher estimates of uncertainty. 

We derive the uncertainties for the filament estimators 
as described in Section 2.3 from the two test datasets; the 
results for \ow-z and high-z samples are given in Figures 14 
and 15, respectively. We visualize local uncertainty using 
color, where red indicates locations where the filamentary 
structure is highly uncertain. We also display uncertainty 
regions as bands of varying width (shown in gray) centered 
on the filaments. Our simulation study in Section 3.2 indi¬ 
cates that the filament coverage FCov for the regions in Fig¬ 
ures 14(a) and 15(a) is ~ 45%, while that in Figures 14(b) 
and 15(b), is 60%. We find that the overall structure for 
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n = 2500 (density: 5.56 x 10 ^/Mpc^) 


n=2500, No Boundary Padding 



n = 2500, Smooth Bootstrap 



(c) Uncertainty Measure with Boundary Points 


n=2500, No Boundary Padding 



(d) Uncertainty Measure without Boundary Points 


Figure 11. Simulated example with sample size 2500 that demonstrates the boundary bias of SCMS. To demonstrate this bias, we 
remove points outside the orange rectangle (the so-called boundary points), (a) and (b): Comparisons between SCMS with boundary 
points (blue) and SCMS without boundary points (red). As can be seen, the bias (between red and blue curves) is large for filaments 
whose distance to the boundary are less than 10 Mpc/h (2x smoothing bandwidth h). (c) and (d): Uncertainty measure for the filaments 
with and without boundary points. Notice that in (d), filaments near the boundary tend to have much higher uncertainty. The unit to 
the color in uncertainty band is Mpc, the same as X and Y axes. 


filaments in the high-; 2 ; dataset is more stable than for the 
low-z data, due to the significantly larger size of the high-z 
dataset; as shown in Figure 8, sample size plays a crucial role 
in determining the size of the uncertainty regions associated 
with SCMS filament estimates. 

As can be inferred from Figures 14 and 15, our measures 
of local uncertainty provide useful information to determine 
the quality of filament detections. We declare a point x E R 
to be ‘unstable’ if 

p{x) ^ p + 1.69crp, (16) 

where p is the mean of uncertainty over all filament points 
and <7p is the root mean square of uncertainty. Namely, if the 
local uncertainty at x is too large, this point is not stable. 
The constant 1.69 comes from the width of 90% confidence 
interval for a Gaussian distribution. For instance the two fil¬ 
aments at (RA,(5)= (165°,40°) and (165°,20°) in Figure 12 


appear by eye to be spurious, given the relative lack of galax¬ 
ies in their vicinity. Based on the uncertainty measures and 
our stability test (16), these filaments are declared as unsta¬ 
ble (yellow color in Figure 14). 

3.3.3 Test Data: Comparison to redMaPPer Clusters 

As one last demonstration of the efficacy of SCMS, we ex¬ 
amine the consistency between our filament maps and the 
galaxy clusters in the redMaPPer catalog (Rykoff et al. 2014; 
Rozo Sz Rykoff 2014; Rozo et al. 2015). We make this com¬ 
parison within the window 

100° ^ RA ^ 270° - 10° ^ (5 ^ 70° 

and within annuli of width Az = 0.005 from zio = 0.100 
to Zhi = 0.500 (a range that includes 10,602 galaxy clusters 
with spectroscopically determined redshifts, or 93.1% of the 
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Figure 12. Application of SCMS to low- 2 ; data {z = 0.235 



Figure 13. Application of SCMS to high- 2 ; data {z = 0.530 

redMaPPer sample). Note that we also include SDSS DR7 
main sample galaxy from NYU VAGC (Blanton et al. 2005; 
Padmanabhan et al. 2008; Adelman-McCarthy et al. 2008) 
to detect filaments for low redshift regions {z < 0.25). We 
slice the data primarily for computational efficiency, since 
SCMS is an O(n^) algorithm, but slicing has the ancillary 
benefit of simplifying visualization. In total we examine 80 
slices, each of which contains ~ 100 galaxy clusters. Within 
each slice, we determine optimal values of h and r using the 
criteria described in Appendix A. 

In Figure 16, we display SCMS-detected filaments along 
with redMaPPer clusters (in red). As can be seen, nearly 
all galaxy clusters are associated with detected filaments. 
Qualitatively similar results hold for all other slices. To 
quantify the association of galaxy clusters and filaments, 
we compare the distance to filaments for three types of 
objects: galaxy clusters, galaxies and randomly generated 
points within the regions where galaxies are observed. We di¬ 
vide the whole redshift range 2 ; = 0.100 — 0.500 evenly into 8 



0.240). The blue curves are filaments detected by SCMS. 



0.535). The blue curves are filaments detected by SCMS. 

sub-regions (each sub-region contains 10 slices); within each 
sub-region we compute distance statistics. Ideally, galaxy 
clusters should be systematically closer to filaments than 
galaxies are, and both galaxies and galaxy clusters should be 
far closer to filaments than randomly generated points. Fig¬ 
ure 17 and Table 1 confirm this hypothesis. Figure 17 shows 
the cumulative distribution for these distance statistics. For 
a collection of values • • • , Xn, the cumulative distribution 
function (CDF) is a non-decreasing function ranging from 0 
to 1 defined as 

F{x) = - (17) 

i=l 

Both galaxies (blue curves) and galaxy clusters (red curves) 
tend to be much closer to the filaments than random points; 
this suggests that galaxies and galaxy clusters are indeed 
concentrated around the detected filaments. When we com¬ 
pare galaxies and clusters, we observe that galaxy clusters 
are much more right-skewed in the CDF plot for every red- 


© 2015 RAS, MNRAS 000 , 1-18 






Cosmic Web Reconstruction through Density Ridges 13 



140 150 160 170 

RA 

(a) The bootstrap estimate. 


140 150 160 170 

RA 

(b) The smooth bootstrap estimate. 


5 


3 


2 


Figure 14. Local uncertainty estimates for our low- 2 ; SDSS dataset (z = 0.235 — 0.240). We display the amount of uncertainty via color 
(red: high) and a confidence band in (a),(b) using both ordinary bootstrap and the smooth bootstrap. The filament points surrounded by 
yellow colors are those with high uncertainty and are declared as ‘unstable’. Based on the simulation result in Figure 10, we expect that 
the gray regions in plot (a) contain about 50% true filaments and in (b) contain 85% true filaments. The unit to the color in uncertainty 
band is degree. 




(b) The smooth bootstrap estimate. 


Figure 15. Local uncertainty estimates for our high- 2 : SDSS dataset {z = 0.530 — 0.535). We display the amount of uncertainty via color 
(red: high) and a confidence band in (a),(b) using both ordinary bootstrap and the smooth bootstrap. The filament points surrounded by 
yellow colors are those with high uncertainty and are declared as ‘unstable’. Based on the simulation result in Figure 10, we expect that 
the gray regions in plot (a) contain about 50% true filaments and in (b) contain 85% true filaments. The unit to the color in uncertainty 
band is degree. 


shift sub-region. That is, galaxy clusters tend to distribute 
around low-distance-to-filament regions compared to a ran¬ 
dom galaxy. We conduct the two-sample, one-sided KS test 
(Stephens 1974), which compares the distributions of dis¬ 
tance statistics for galaxy clusters and randomly generated 
points, for all eight sub-regions. Table 1 shows the p-values, 
a statistical quantity measuring the significance of observa¬ 
tions, for the eight KS tests that we carry out. A smaller p- 
value indicates stronger evidence for clusters being closer to 
a filament than galaxies. Typically, we declare significance 
as p-value being less than 0.05. We observe an increasing 


trend in p-value as the redshift increases, due to the de¬ 
crease in the number density of galaxies along the line of 
sight. The sharp reversal in this trend at the last sub-region 
{z = 0.450 — 0.500) is due to the large size of the CMASS 
sample at 2 : > 0.430: the number density of galaxies in our 
sample actually increases from z = 0.430 — 0.500. 

Note that in Figure 16, many clusters appear to be lo¬ 
cated near the intersections of filaments. However, we do 
not construct a statistics to summarize this phenomena since 
defining the intersections of filaments detected by SCMS is a 
non-trivial problem. The main difficulty is due to the ‘gap’ 
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Redshift 

p-value 

Redshift 

p-value 

0.100-0.150 

4.38 X 

0.300-0.350 

1.73 X 10-1® 

0.150-0.200 

1.01 X 10-31 

0.350-0.400 

1.53 X 10-19 

0.200-0.250 

1.66 X 10-26 

0.400-0.450 

7.56 X 10-1"! 

0.250-0.300 

2.26 X 10-19 

0.450-0.500 

1.95 X 10-19 


Table 1. Significances generated from a one-sided, two-sample 
KS test, for the null hypothesis that galaxy clusters lie at the 
same average distance from filaments as field galaxies, p-value 
is a statistical quantity to measure the significance. Typically, 
the usual rejection rule requires p < 0.05. The p-values show 
strong evidence that clusters he much closer to filaments than 
field galaxies. 

between filaments; the SCMS filaments will not intersect 
each other but with a small gap. This gap can be explained 
by the model of density ridges. In the density ridges model, 
we require the eigengap /3 > 0 (recall equation (3)) to en¬ 
sure the properties of filaments. Therefore, when one ridge 
merges with another, the eigengap vanishes at some point 
(i.e. /3 = 0). This leaves a small gap between one ridge and 
another. 


4 SUMMARY AND DISCUSSION 

In this paper, we demonstrate how one may apply the Sub¬ 
space Constrained Mean Shift (SCMS) algorithm of Oz- 
ertem & Erdogmus (2011) to uncover filamentary structure 
in galaxy point cloud data. The density ridge model behind 
the SCMS algorithm ensures that galaxies will concentrate 
around detected filaments. In addition, we introduce an un¬ 
certainty measure for detected filaments that is based on 
the bootstrap, allowing us to study the significance of these 
filaments. 

In §3 we first show that the SCMS filaments are very 
similar to the Voronoi filaments. Then we demonstrate the 
efficacy of our SCMS-based filament-finding algorithm by 
applying it both to P3M N-body simulation output and to 
SDSS DR 12 data (including the NYU main sample galaxy, 
LOWZ and CMASS datasets). By applying SCMS to sim¬ 
ulated data, we are able to estimate the coverage of our 
bootstrap-generated uncertainty bands, i.e. the fraction of 
any one true filament that lies within its associated band 
(see Figure 10). We find that the coverage depends sensi¬ 
tively on the number of galaxies in an analyzed sample, with 
the smooth bootstrap algorithm generating more conserva¬ 
tive uncertainty bands with la coverage ?^0.6-0.8 (cf. 0.683 
for a la confidence band) for galaxy number densities 
X 10“'‘ - 5 X 10 ^ (densities observed/to be observed by 
SDSS CMASS and WFIRST, respectively). 

In Figures 12-15, we show the results of applying the 
SCMS algorithm to SDSS spectroscopically observed galax¬ 
ies in the redshift slices 0.235 ^ ^ ^ 0.240 and 0.530 ^ ^ ^ 
0.535, respectively. To test the hypothesis that our estimated 
filaments are associated with real filamentary structures, we 
compare the distances between filaments and redMaPPer 
galaxy clusters, random field galaxies, and random points 
in the galaxy field. By using the one-sided, two-sample KS 
test, we find that we can safely reject the null hypothesis 


that galaxy clusters and field galaxies reside at similar dis¬ 
tances from filaments; the p-values are < 10“® (cf. the usual 
rejection criterion that p < 0.05; see Table 1). 

The SCMS algorithm models filaments as one¬ 
dimensional ridges that trace high-density regions within 
the point cloud; as such, SCMS may be grouped with other 
filament-modeling algorithms that use the eigenvalues and 
eigenvectors of the Hessian matrix associated with the point 
cloud density function, such as MMF (Aragon-Calvo et al. 
2007, 2010a) and NEXUS/NEXUS+ (Cautun et al. 2013). 
However, in contrast to these methods, which output fila¬ 
ment estimates as two-dimensional regions, SCMS filament 
estimates are smooth, one-dimensional curves; the filament 
orientations are well-defined. Also in contrast to these meth¬ 
ods, we offer measures of uncertainty by augmenting the 
SCMS algorithm with bootstrap-based uncertainty estima¬ 
tion algorithms that allow one to e.g. place bands around 
putative filaments, whose relative sizes indicate uncertainty 
in filament location (as in e.g. Figure 5). We note that the 
segmentation-based DisPerSE algorithm of Sousbie (2011) 
uses the persistence ratio, a metric encapsulating the evo¬ 
lution of topological structure in the galaxy field, to define 
the significance of putative filaments, but not their spatial 
uncertainty. Finally, we compare SCMS filaments to those 
generated by the Spine (Aragon-Calvo et al. 2010b; Aragon- 
Calvo & Yang 2014) and Skeleton (Novikov et al. 2006) algo¬ 
rithms. Both the Skeleton and Spine models look for ridges 
within a density field. However, the Skeleton model does 
not provide a means by which to compute density ridges. In 
contrast, the SCMS algorithm allows us to efficiently com¬ 
pute ridges of the field’s kernel density estimate. The Spine 
method outputs ridges as points on grids, so that resolution 
is an issue. On the other hand, the SCMS algorithm yields 
points that are on continuous curves (ridges), so there is no 
resolution issue to address. 

We conclude by stating that one may extend the use 
of the SCMS algorithm beyond the analysis of galaxy point 
cloud data. For instance, Chen et al. (2014b) discuss how to 
apply the algorithm to pixelized image data; in particular, 
they modify the algorithm (calling it the weighted SCMS 
algorithm) to find intensity ridges caused by e.g. tidal tails. 
In addition, the authors also discuss how one would incor¬ 
porate the mass of a galaxy to achieve a better estimate of 
the local density as well as of corresponding ridges. 
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(a) 

Figure 16. Comparison of SCMS filaments to redMaPPer galaxy clusters, for 2 ; = 0.145 — 0.150. Shown are SDSS galaxies (black), 
putative filaments (blue), and redMaPPer galaxy clusters (red). As shown in Table 1, the redMaPPer clusters he significantly closer to 
filaments than randomly selected points in the analysis window. 


Distance To Filaments (z=0.1-0.15) 


Distance To Filaments (z=0.45-0.5) 




Figure 17. The cumulative distribution of the distance statistics from galaxies to filaments (blue) versus galaxy clusters to filaments (red) 
at different redshifts. We also display the distribution for random points (black) as a reference. The galaxy clusters are from redMaPPer 
catalog. The unit of distance is ‘degree’. We only display the first {z = 0.100 — 0.150) and the last sub-region {z = 0.450 — 0.500) since 
other regions have a similar result. The p-value for each region is given in Table 1. 
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APPENDIX A: PARAMETER SELECTION 


Our version of SCMS has two key parameters, the smoothing 
bandwidth h and the threshold level r. In this section we 
show how we select optimal values for each. 

The smoothing bandwidth h represents the amount by 
which we smooth the observed point cloud of galaxies when 
estimating p. One can choose h by applying prior knowledge 
or by letting h adapt to the sample. There is a large body 
of literature on the choice of bandwidth, e.g. Sheather et al. 
(2004) and Chacon et al. (2011, 2013). Among all methods, 
we recommend choosing h via 


h = Aq X 


y d + 2 


d+4 -1 

Tl <^+4 (Jjnin 5 


(Al) 


where Aq is a constant that we discuss below, n is the sample 
size, d is the dimension (in our case d = 2) and cTmin is the 
minimal value for the standard deviation along each coordi¬ 
nate. Note that the reference rule (Al) will choose smaller 
h values as the sample size increases. 

If Aq is 1, (Al) corresponds to Silverman’s rule (Sil¬ 
verman 1986). Silverman’s rule selects h via minimizing the 
mean integrated error 

¥, (^J \p{x) - p{x)f dx^ (A2) 

when p is a Gaussian. When the data include filaments, p 
is no longer Gaussian and Aq must be optimized as a free 
parameter. A smaller Aq yields more filaments in a given 
dataset but more spurious filaments as well. There is no 
general rule for selecting Aq since the optimality criterion 
involves the unknown density p. Figure Al shows how vary¬ 
ing Aq affects the estimation of filamentary structures. Our 
results indicate that the optimal Aq lies in the range [0.4,0.8]. 
This is further confirmed by true positive and false positive 
coverage of N-body simulation (described in section 3.2) as 
shown in Figure A2. In N-body simulation, Aq = 0.4 corre¬ 
sponds to d = 5 Mpc (actual value is 4.82) while Aq = 0.8 
corresponds to d = 10 Mpc (actual value is 9.65). Both val¬ 
ues are better than d being too large or too small (compared 
with h — 2 Mpc and d = 15 Mpc cases). In our analyses of 
SDSS data, we adopt the value Aq = 0.4. 

Thresholding stabilizes the ridge-finding process since 
random noise may cause small bumps in the estimated den¬ 
sity field. However, if the threshold is set too high, we will 
remove useful information about the field. We recommend 
selecting the thresholding level according to the root mean 
square (RMS) in the density fluctuation: 

T = a{^= (^(P(®) - P(^)fdx'^ ^P-P, (A3) 

where K is the region we are interested in and p(K) is the av¬ 
erage density in K. Note that thresholding is also utilized by 


the MMF (Aragon-Galvo et al. 2010a) and NEXUS (Gautun 
et al. 2013) filament- (and galaxy cluster-) detection algo¬ 
rithms. 
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redshift range 2 ; = 0.045 — 0.050. The gray dots are galaxies with density under r = cr(p)- The black dots are galaxies with density above 
r. In panel (a)-(c), blue curves are filaments detected by SCMS. In panel (d), we compare the filaments from (a)-(c). 
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Figure A2. The true positive (top row) and false positive (bottom row) coverage for different sample sizes. As can be seen, h = 5 or 
h = 10 (corresponds to the reference rule (Al) using Aq = 0.4 and 0.8) are good choices for both true and false positive coverage. Note 
that the reason h = 2 has good true positive coverage is because h = 2 undersmooths the data, leading to numerous small filaments. 
Thus, it is more likely that there are some estimated filaments around true filaments, which increases the true positive coverage but also 
increases the false positive coverage (as true filaments may not appear around some estimated filaments). 
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